Image Captioning project

Data loading

The dataset that we are using is the Viz Wiz dataset. The original dataset contains 39,181 images with 5 captions for each one of these images.

Our aim is to develop a model that, given an image, returns a sentence that describes it.

The original train dataset is too large (23,431 images). We chose to perform training on the original validation dataset (7,750 images). As the original test dataset captions are not publicly shared, we chose to perform testing on a subset of training set (1171 images).

Defining Vocabulary class

Below, we define a vocabulary class. It will be constructed using the captions corpus of our training dataset. We construct two dictionnaries to go from words to numerical values and vice versa.

As text preprocessing, for the moment, we only put all the words to lowercase. We could also apply other preprocessing and cleaning techniques such as lemmatization, stemming, removing the non-alphanumeric characters etc.

Moreover, when building the vocabulary, we only keep the most frequent words in the corpus using a frequency threshold.

Defining the Vizwiz Dataset class

Here we define a custom dataset which will make data loading and training more convenient.

Defining pad batch class and the dataloader

The Model Architecture

Our captioning Model is Seq2Seq model. The Encoder uses a pretrained CNN (ResNet model) to extract the features. The Decoder, is composed of LSTM and also uses attention mechanism between the feature maps produced by the encoder and the decoder hidden states. More specifically, we used an implementation of the Bahdanau Attention Decoder.

Encoder

The encoder takes as input an 224x224 image and produces 49 feature maps, each one of size 2048.

Bahdanau Attention block.

Decoder

The decoder has an LSTM-based architecture enriched with attention mechanism. At each decoding step, the decoder receives a context vector resulting from the interaction between its current hidden state and all the encoder hidden states. To initialize the LSTM hidden state and cell memory, we average the encoder feature maps.

Whole Architecture: Encoder + Decoder

Finally, we put all the previous blocks together to build our final model.

Model

Training or loading pretrained model

You can choose whether to load a trained model weights or train a new model. We saved 3 training checkpoints which we made available here: https://drive.google.com/drive/folders/1cpSWwWRYtvuLwaElMz5EwepsBygWkdev

Visualizing the attention weights

Defining helper functions

  • Given the image generate captions and attention scores
  • Plot the attention scores in the image
  • Testing on new images

    Since we trained on the Vizwiz validation set, we will use a subset from the Vizwiz training set as a test set to evaluate our model performance.

    Now, we calulate the BLEU score for all the test set. To do so, we take the average of the BLEU scores of the test samples

    Let us compare this to a non-trained model, i.e a model with random weights.